[VL] Enable native Parquet write for complex types (Struct/Array/Map)#11788
[VL] Enable native Parquet write for complex types (Struct/Array/Map)#11788Zouxxyy wants to merge 5 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Write Validation Chain (spark3.4+) Generated-by: Kiro (Claude Opus 4.6) |
There was a problem hiding this comment.
Pull request overview
This PR enables Velox’s native Parquet write path to support complex Spark SQL types (Struct/Array/Map) by removing earlier type-gating and adjusting the Velox write validation to allow nested types for Parquet.
Changes:
- Removes the schema-based “native write supported” gate (
supportNativeWrite) and relies on the WriteFiles validation path instead. - Updates Velox write validation to allow Parquet
StructType(still blocksYearMonthIntervalType) and reorders validation checks. - Simplifies Delta Parquet native-writability checks and adds new Velox Parquet write tests for complex/nested types.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| gluten-substrait/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenWriterColumnarRules.scala | Removes schema gate before enabling native write properties/adaptor injection. |
| gluten-substrait/src/main/scala/org/apache/gluten/backendsapi/BackendSettingsApi.scala | Deletes supportNativeWrite from the backend settings API. |
| backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala | Allows Parquet struct types; refactors/reorders native write validation chain. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/VeloxParquetWriteSuite.scala | Adds native Parquet write coverage for struct/array/map and nested struct. |
| backends-velox/src-delta33/main/scala/org/apache/spark/sql/delta/files/GlutenDeltaFileFormatWriter.scala | Removes dependency on deleted Parquet companion helper; forces native-writable flag. |
| backends-velox/src-delta33/main/scala/org/apache/spark/sql/delta/GlutenParquetFileFormat.scala | Removes fallback branch + companion object; always uses Gluten Parquet OutputWriterFactory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case rc @ DataWritingCommandExec(cmd, child) => | ||
| // The same thread can set these properties in the last query submission. | ||
| val format = | ||
| if ( | ||
| BackendsApiManager.getSettings.supportNativeWrite(child.schema.fields) && | ||
| BackendsApiManager.getSettings.enableNativeWriteFiles() | ||
| ) { | ||
| if (BackendsApiManager.getSettings.enableNativeWriteFiles()) { | ||
| getNativeFormat(cmd) | ||
| } else { |
There was a problem hiding this comment.
Not a real issue. On Spark 3.4+, NativeWritePostRule only injects FakeRowAdaptor — the actual native/fallback decision is made by WriteFilesExecTransformer.doValidateInternal, which already rejects constant complex types. When validation fails, WriteFilesExec stays vanilla and FakeRowAdaptor is a harmless pass-through.
Also, #11787 (already merged) restricts NativeWritePostRule to Spark 3.3 only, so this code path is no longer reachable on 3.4+.
backends-velox/src-delta33/main/scala/org/apache/spark/sql/delta/GlutenParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
ce64e75 to
3d5d937
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
CC @jackylee-ch @jinchengchenghh @zhztheplayer for a look, thanks. For reference, the Iceberg write path does not restrict complex types (Struct/Array/Map) at the format level too. |
| enableSuite[GlutenVariantSuite] | ||
| // TODO: Velox parquet writer marks all struct fields as OPTIONAL (nullable), | ||
| // but Spark's variant type requires REQUIRED fields. Needs Velox-side fix. | ||
| .exclude("SPARK-47546: invalid variant binary") |
There was a problem hiding this comment.
It appears I don't have permission to edit this. Perhaps I can modify it in the future—I actually have experience working with variants.
There was a problem hiding this comment.
Could you add sub issue to it?
jackylee-ch
left a comment
There was a problem hiding this comment.
Basically looks good to me. Left few comments
backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala
Show resolved
Hide resolved
backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala
Show resolved
Hide resolved
backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala
Outdated
Show resolved
Hide resolved
| field.dataType match { | ||
| case _: StructType => Some("StructType") | ||
| case _: ArrayType => Some("ArrayType") | ||
| case _: MapType => Some("MapType") |
There was a problem hiding this comment.
Do we have any native test cases for HiveFileFormat's StructType, ArrayType, and MapType?
There was a problem hiding this comment.
thanks for the review, add the test
|
Run Gluten Clickhouse CI on x86 |
What changes are proposed in this pull request?
Enable native Parquet write for complex types (Struct/Array/Map) in Velox backend.
Velox's parquet writer converts vectors to Arrow then writes via Arrow's Parquet writer, which natively supports nested types. The previous Scala-side type restrictions were unnecessary.
Changes:
supportNativeWritegate — no longer needed sincesupportWriteFilesExechandles validationvalidateDataTypesfor Parquet (onlyYearMonthIntervalTyperemains blocked, as Arrow has no mapping for it)validateDataTypesrecursively check nested types forYearMonthIntervalTypeNote:
How was this patch tested?
New tests in
VeloxParquetWriteSuite. Existing tests pass.Was this patch authored or co-authored using generative AI tooling?
Cooperated with: Kiro (Claude Opus 4.6)